fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context by sayakmaity · Pull Request #849 · scaleapi/llm-engine

sayakmaity · 2026-06-25T21:26:06Z

Problem

model-engine's Datadog env tag is derived from the build-time Helm value .Values.context, so every deployment reports the same env (e.g. production) regardless of which cluster/environment it actually runs in. This affects two independent surfaces:

Control-plane pods (gateway, builder, cacher, celery autoscaler) — DD_ENV and the tags.datadoghq.com/env label both come straight from .Values.context.
Runtime-launched inference endpoints — DD_ENV is baked into the rendered service_template_config_map from .Values.context at helm template time, so it's frozen in the image and identical for every cluster the image is deployed to (unlike the neighboring DD_SERVICE/DD_VERSION, which are already ${...} runtime-substituted).

Approach

Introduce a dedicated, optional datadog.env value, decoupled from context (which is overloaded — it also selects the service_template_config_map variant, drives a context == "circleci" conditional, and sets non-DD labels, so it must not be repurposed). The fix mirrors the existing GIT_TAG/DD_VERSION split exactly:

Control-plane → DD_ENV / env label = {{ .Values.datadog.env | default .Values.context }} (Helm value, backwards-compatible: falls back to context when unset).
Inference endpoints → DD_ENV = ${DD_ENV} runtime substitution, populated at endpoint-creation from the gateway's own DD_ENV (so launched pods inherit the gateway's per-cluster env). This follows the same pattern as ${DD_TRACE_ENABLED} / ${GIT_TAG}.

Changes

Chart (charts/model-engine)

values.yaml: add optional datadog.env (default "").
_helpers.tpl:
- baseLabels: tags.datadoghq.com/env → datadog.env | default context.
- serviceEnvBase: remove the hardcoded DD_ENV (moved into the wrappers, like GIT_TAG).
- serviceEnvGitTagFromHelmVar (control-plane): add DD_ENV = datadog.env | default context.
- serviceEnvGitTagFromPythonReplace (endpoints): add DD_ENV = ${DD_ENV}.
- baseServiceTemplateEnv / baseForwarderTemplateEnv (legacy endpoint paths): DD_ENV → ${DD_ENV}.
celery_autoscaler_stateful_set.yaml: $env → datadog.env | default context.
Chart.yaml: 0.2.6 → 0.2.7.

Server (model-engine)

common/env_vars.py: add DD_ENV (os.environ.get("DD_ENV") or infra_config().env).
infra/gateways/resources/k8s_resource_types.py: add DD_ENV to _BaseDeploymentArguments and pass it in all 11 deployment-argument constructors (alongside DD_TRACE_ENABLED).

Validation

helm template (with values_sample.yaml): control-plane DD_ENV/label render to datadog.env when set and fall back to context when unset; inference DD_ENV renders to ${DD_ENV} for runtime substitution. No <no value> / unrendered {{ }}.
python -m ast parse clean on both edited modules.
Substitution layer uses safe_substitute, and get_endpoint_resource_arguments_from_request now supplies DD_ENV, so ${DD_ENV} resolves at endpoint creation. The existing test_k8s_endpoint_resource_delegate.py tests render the real chart and assert structurally (no golden env-block comparison), so they remain compatible.
I did not run the full model-engine test suite locally (heavy service deps) — CI should confirm.

Out of scope / known limitations

The inference pod label tags.datadoghq.com/env (baseTemplateLabels) is intentionally left on context for now — that helper is shared with job templates, so making it ${DD_ENV} would require threading DD_ENV through the job argument classes too. The DD_ENV env var is the authoritative APM env source, so this is sufficient to fix env tagging on traces/metrics; the label can follow up.
Downstream: consumers that pin a specific chart version / bake the rendered service_template_config_map_<env>.yaml into an image (e.g. model-engine-internal's just autogen-templates + image build) must regenerate templates and rebuild the image to pick up ${DD_ENV}; and the rendered config map's DD_ENV line will become a runtime placeholder.

Draft pending CI + downstream coordination.

…context

sayakmaity · 2026-06-25T21:27:00Z

SGP consumer wiring that populates engine.datadog.env from the per-cluster info object: scaleapi/sgp#3730 (draft).

…infra_config().env

sayakmaity · 2026-06-26T16:37:23Z

Follow-up commit (ad9c421): tag gateway custom metrics with DD_ENV instead of infra_config().env.

DatadogMonitoringMetricsGateway tagged its statsd metrics with env:{infra_config().env} — the build-time deployment class, not the cluster's Datadog env. Switched to the new per-cluster DD_ENV. (celery_autoscaler.py already reads os.getenv("DD_ENV"), so it's covered by the chart change.) This pairs with scaleapi/sgp#3730, which removes a GCP-only infra.env=dev override; without this commit, that removal would have regressed GCP-dev custom metrics to env:production.

…l injection

sayakmaity · 2026-06-26T19:10:40Z

Review follow-up (64db255): addresses both findings via central injection at the load_k8s_yaml substitution chokepoint.

Root cause confirmed for both:

P1 — set_main_container_datadog_env re-derives the user container's DD_ENV from the pod label tags.datadoghq.com/env (k8s_endpoint_resource_delegate.py), and that label (baseTemplateLabels) was baked context, so runnable-image endpoints lost the new env.
P2 — ${DD_ENV} in $service_env is consumed by the batch-job templates (service_template_config_map.yaml:1124/1251), whose arg builders (live_batch_job_orchestration_gateway.py, live_docker_image_batch_job_gateway.py) never supply DD_ENV, so safe_substitute left the literal.

Fix: load_k8s_yaml is the single substitution chokepoint for every launched resource (endpoints, batch, cron, image-cache, all sub-resources). Inject DD_ENV (the gateway's per-cluster env from env_vars) there once: filtered_kwargs.setdefault("DD_ENV", DD_ENV). Then:

baseTemplateLabels label tags.datadoghq.com/env → ${DD_ENV} (so the delegate reads a per-cluster label for the user container), and batch log annotations → env:${DD_ENV}.
This covers labels, forwarder env, batch env, and log configs uniformly — no per-builder threading and no risk of a literal ${DD_ENV} in a k8s label (which would be rejected). I reverted the earlier per-constructor DD_ENV plumbing in k8s_resource_types.py (now net-zero) in favor of this.

Verified with helm template: control-plane labels/DD_ENV render datadog.env (Helm), inference labels/env/log render ${DD_ENV} (runtime), and substitution simulation resolves ${DD_ENV}→sgp-dev. Full model-engine test suite not run locally (mypy/CI to confirm).

…itution

… leakage across emissions

sayakmaity · 2026-06-26T19:33:15Z

Round 3 review follow-up:

[P3] Regenerated the checked-in fallback template (5064a96). service_template_config_map_circleci.yaml is the default for LAUNCH_SERVICE_TEMPLATE_CONFIG_MAP_PATH (local/CI); prod uses the Helm-mounted configmap via LAUNCH_SERVICE_TEMPLATE_FOLDER, and the unit tests render the chart fresh — so this is a local/CI consistency gap, not a prod bug. Regenerated from the updated gotemplate: 41 launched-pod tags.datadoghq.com/env → ${DD_ENV}, 14 DD_ENV env values → ${DD_ENV}, 3 log annotations → env:${DD_ENV}; the ConfigMap's own metadata label correctly stays baked (baseLabels). Heads-up: the file was already stale at chart 0.2.4, so regenerating at 0.2.7 also syncs 3 unrelated pre-existing drift lines (HPA Value→AverageValue, forwarder max_concurrency) — not my logic changes. (Regenerated with helm v4 + trailing-whitespace strip to match repo style; maintainers may re-run just autogen-templates with the pinned toolchain.)

[Note] Fixed the mutable self.tags leak (fa9ed71) in DatadogMonitoringMetricsGateway while here: _format_call_tags and emit_http_call_error_metrics did tags = self.tags (alias) then .extend(...), permanently appending per-call model_name/endpoint_name/error_code to the shared base list. Now copy via [*self.tags, ...].

[Note] k8s_resource_types.py is intentionally net-zero — the earlier per-constructor plumbing was reverted in favor of the central load_k8s_yaml injection.

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time …

076adf0

…context

fix: tag gateway custom metrics with DD_ENV (per-cluster) instead of …

ad9c421

…infra_config().env

fix: resolve launched-pod DD_ENV per-cluster via central load_k8s_yam…

64db255

…l injection

sayakmaity added 2 commits June 26, 2026 15:32

chore: regenerate circleci service-template fallback for DD_ENV subst…

5064a96

…itution

fix: copy self.tags in DatadogMonitoringMetricsGateway to prevent tag…

fa9ed71

… leakage across emissions

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849

fix: make Datadog env tag per-cluster (DD_ENV) instead of build-time context#849
sayakmaity wants to merge 5 commits into
scaleapi:mainfrom
sayakmaity:sayakmaity/model-engine-dd-env-per-cluster

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

sayakmaity commented Jun 25, 2026

Problem

Approach

Changes

Validation

Out of scope / known limitations

Uh oh!

sayakmaity commented Jun 25, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

sayakmaity commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant